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Abstract 

In [2] M. Benaim and G. Ben Arous solve a multi-armed bandit problem arising in 
the theory of learning in games. We propose an short elementary proof of this result 
based on a variant of the Kronecker Lemma. 
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In [2] a multi-armed bandit problem is addressed and investigated by M. Benaim and 
G. Ben Arous. Let fo, ■ ■ ■ , fd denote d + 1 real-valued continuous functions defined on 
[0, Given a sequence x = (x n ) n >i € {0, ... , (the strategy), set for every n > 1 

x n •= (x n , x n , . . . , x n ) with x n := — ^ ] lj^—jj , i = 0, . . . , d, 



n k=i 



and 



n— 1 



Q(x) = liminf- \ f Xk+1 (x k ). 



(xq := (xq,Xq, . . . ,Xq) € [0, 1] + , Xq + • • • + Xq = 1 is a starting distribution). Imagine 
d + 1 players enrolled in a cooperative/competitive game with the following simple rules: 
if player i € {0, . . . , d} plays at time n he is rewarded by fi(x n ), otherwise he gets nothing; 
only one player can play at the same time. Then the sequence x is a playing strategy for 
the group of players and Q(x) is the global cumulative worst payoff rate of the strategy 
x for the whole community of players (regardless of the cumulative payoff rate of each 
player) . 

In [2] an answer (see Theorem 1 below) is provided to the following question 
What are the good strategies (for the group) ? 
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The authors rely on some recent tools developed in stochastic approximation theory (see 
e.g. [1]). The aim of this note is to provide an elementary and shorter proof based on a 
slight improvement of the Kronecker Lemma. 

Let S d := {ve [0, l] d , £? = i Vi < 1} and V d+1 := {nG [0, l] d+1 , Ef=i «i = 1}- Further- 
more, for notational convenience, set 

d 

yv = (v 1 ,...,v d )eS d , v := (1 - ^Vi,vi,. . . ,v d ) G V d +i, 

i=i 

Vu = (u ,u 1 ,...,u d )e Vd+i, u := (u±, . . . , u d ) G S d . 

The canonical inner product on M, d will be denoted by (v\v') = Ylf=i v i v i- The interior of 
a subset A ofM. d will be denoted A. For a sequence u = (u n ) n >i, Au n :=u n — u n _i, n>l. 

The main result is the following theorem (first established in [2]). 

o 

Theorem 1 Assume there is a function : S d — > R, continuously differentiable on S d 
having a continuous extension V$ on 5^ and satisfying: 

VveS d , V*(v) = (f i (v)-f (v)) 1 < i < d . (1) 

5et /or ever?/ n G "P^+i; 

<2+l 

«H : = ^Uifi(u) 

and Q* : = max {q(u), u^V d+ \}. Then, for every strategy xG {0,1, ... , d} N * , 

g(x) < q*. 

Furthermore, for any strategy x such that x n — > Xoo, 
1 n 

-^2 fx k+1 (xk) ^ q(%oo) as n ->■ oo (so that Q(x) = <?(xoo))- 
n k=i 

In particular there is no better strategy than choosing the player at random according to 
an i.i.d. strategy with distribution x* G argmaxg. 

The key of the proof is the following slight extension of the Kronecker Lemma. 

Lemma 1 ("a la Kronecker" Lemma) Let (6 n )n>i be a nondecreasing sequence of positive 
real numbers converging to +oo and let (a n ) n >i be a sequence of real numbers. Then 
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ECL k 
— , n > 1 and Co = so that a n = b n AC n . As a consequence, an 
h 



k=l 

Abel transform yields 



1™ 1/ n \ 

r^«* = T- E b k A C k = — b n C n - ]T C k ^Ab k 

1 n 

= C n — — Cfc-iA6fc. 
° n fe=i 

Now, liminf C n being finite, for every e > 0, there is an integer n e such that for every 

n— ++oo 

> ra £ , Cfe > liminf C n — e. Hence 

- ]T <?k-iA&* >-]T C fc -iA6 fc + ^ liminf C fc - e ) . 







Consequently, lim inf C n being finite, one concludes that 

1 " ( \ 

lim inf — a k < lim inf C n — — 1 x ( lim inf C k — £ I = s. 

n— >+oc 5 n ^ n-++oo \k^+oo / 

Proof of Theorem 1. First note that for every u = (no, . . . , Ud) € Pd+i, 

d+l d 

Q( u ) : = U ifi( U ) = fo( u ) + J2ui(fi( u ) ~ fo(u)) 
i=0 i=l 

so that 

Q* = sup \f (v) +J2vi(fi(v) - f (v))\ = sup {f (v) + (v\V*(v))} . 
ves d { i=1 ) ves d 

Now, for every k > 

d d 

fx k+1 (xk) - q(xk) = ^(/i(x fe )i {:Cfc+1=i} - x l k fi(x k )) = J2fi(x k )(i {Xk+1=i} - 4) 

i=0 i=0 
d 

i=Q 

d 

= (k + i)j2(fi(xk)-fo(x k ))^4+i- 

1=1 

The last equality reads using Assumption (1), 

fx k+1 {xk) ~ q(xk) = (fc + l)(V$(x fc )|Ax fc+ i) 



Consequently, by the fundamental formula of calculus applied to <I> on (xk,Xk+i) C S d , 

n—l -i n— 1 



- E fxk+i(xk) - q{xk) = -^2(k + l)(<5>{x k +i)-${x k )) -R n 
n fc=o n fc=o 

n—l 

with R n := -2(v*(| fc )-V*(S fc )|(fe + l)AS fc+1 ) 



fc=o 

and fc = 1, . . . n. The fact that \{k + l)ASfe +1 | < 1 implies 

n—l 



\Rn\ <-52w(V*,\A*k+l\) 
n k=0 

where w(g, 5) denotes the uniform continuity (5-modulus of a function g. One derives from 
the uniform continuity of V<J> on the compact set Sd that 

R n — > as n — > +oo. 

Finally, the continuous function $ being bounded on the compact set Sd, the partial sums 

n-1 
k=0 

remain bounded as n goes to infinity Lemma 1 then implies that 

2 71—1 

liminf - V (k + 1) (*(x fc+ i) - < 0. 

fc=0 

One concludes by noting that on one hand 

n—l 

limsup — < Q* = sup q 

n^oo n fc=() " Vd+1 

and that, on the other hand, the function q being continuous, 

n—l 

lim — > q(xk) = q(x*) as soon as x n — > x*. a 

rj. — tin *n ^ — ' 



fc=0 

Corollary 1 When d+1 = 2 ftoo players), Assumption (1) is satisfied as soon as fo and 
fi are continuous on V2 and then the conclusions of Theoreml hold true. 

Proof: This follows from the obvious fact that the continuous function u\ 1— > /^l — 
ui,ui) — /o(l — u±,ui) on [0, 1] has an antiderivative. a 

Further comments: • If one considers a slightly more general game in which some 
weighted strategies are allowed, the final result is not modified in any way provided the 
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weight sequence satisfies a very light assumption. Namely, assume that at time n the 
reward is 

\i+ifx n+1 {x n ) instead of f Xn+ i(x n ) 

where the weight sequence A = (A n )„>i satisfies 

n A n 
A n > 0, n > 1, S n = Afc — ► +00, — ^- — > as ra — > 00 

fc=i Sn 

then the quantities x^ G 7^+1, x A := (x^'°, . . . , x^ d ) with x£> 1 = ^ ^Li A fc 1 { 2;fc =i}> * = 

2 n— 1 

0. . . . , d, n > 1, and Q A (x) = liminf — A^i/^. (x A ) satisfy all the conclusions of 
Theorem 1 mutatis mutandis. 

• Several applications of Theorem 1 to the theory of learning in games and to stochastic 
fictitious play are extensively investigated in [2] which we refer to for all these aspects. As 
far as we are concerned we will simply make a remark about some "natural" strategies 
which illustrates the theorem in an elementary way. 

In the reward function at time k, i.e. f Xk (xk-i), x fc represents the competitive term 
("who will play ?") and Xk-i represents a cooperative term (everybody's past behaviour 
has influence on everybody's reward). 

This cooperative/competitive antagonism induces that in such a game a greedy com- 
petitive strategy is usually not optimal (when the players do not play a symmetric role). 
Let us be more specific. Assume for the sake of simplicity that d + 1 = 2 (two players). 
Then one may consider without loss of generality that x n — x n i.e. that is a [0, l]-valued 
real number. A greedy competitive strategy is defined by 

player 1 plays at time n {i.e. x n = 1) iff /i(x n _i) > /o(x n _i) (2) 

1. e. the player with the highest reward is nominated to play. Note that such a strategy is 
anticipative from a probabilistic viewpoint. Then, for every n > 1, 

fx n {x n -i) = max(/ (x n _i),/i(x n _i)) 

and it is clear that 

fx n (x n -i) -q{x n ) = max(/ (x n _i),/i(x n _i)) -q{x n ) =: ip(x n ) > 0. 
On the other hand, the proof of Theorem 1 implies that 

n— 1 

liminf — } V>(x n ) < 0. 
n->+oo n f— ' 
fc=0 

Hence, there is at least one weak limiting distribution [Aqq of the sequence of empirical 
measures p, n := - X)o<fe<n-i $x k which is supported by the closed set {ip = 0} C {0, 1} U 
{/o = on the other supp( / u oc ) is contained in the set of the limiting values of the 
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sequence (x n ) itself (in fact is an interval since (x n ) n is bounded and x n+ i — x n — > 0). 
Hence X x D ({0, 1} U {/ = fi}) + 0. 

If the greedy strategy (x n ) n is optimal then dist(x n , argmaxg) — > as n — > oo i.e. 
Xoo C argmaxg. Consequently if 

argma X gn({O,l}U{/ o = /i})=0 (3) 

then the purely competitive strategy is never optimal. 
So is the case if 

fo(x) = ax and fi(x) = b (1 — x), x£ [0, 1], 

for some positive parameters a ^ b, then 

argmaxg = {1/2} and / (l/2) + f^l/2). 

In fact, one shows that the greedy strategy x = (x n ) ra >i defined by (2) satisfies 

and Q{x) = as n — > oo 



a + 5 a + 6 

whereas any optimal (cooperative) strategy (like the i.i.d. Bernoulli(l/2) one) yields an 
asymptotic (relative) global payoff rate 

n * a + b 

U = max q = — ■ — . 

[o,i] 4 

Note that Q* > since a / b. (When a = b the greedy strategy becomes optimal.) 

• A more abstract version of Theorem 1 can be established using the same approach. 
The finite set {0,1,..., d} is replaced by a compact metric set K, Vd+i is replaced by the 
convex set V K of probability distributions on K equipped with the weak topology and the 
continuous function / : K x Vr ^ still derives from a potential function in some sense. 
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